Concerning the problem of insufficient image feature extraction and ignorance of single-modal internal relations and the interactions between single-modal and multi-modal, a text and image information based Multi-Modal Deep Fusion (MMDF) model was proposed. Firstly, the Bi-Gated Recurrent Unit (Bi-GRU) was used to extract the rich semantic features of the text, and the multi-branch Convolutional-Recurrent Neural Network (CNN-RNN) was used to extract the multi-level features of the image. Then the inter-modal and intra-modal attention mechanisms were established to capture the high-level interaction between the fields of language and vision, and the multi-modal joint representation was obtained. Finally, the original representation of each modal and the fused multi-modal joint representation were re-fused according to their attention weights to strengthen the role of the original information. Compared with the Multimodal Variational AutoEncoder (MVAE) model, the proposed model has the accuracy improved by 1.9 percentage points and 2.4 percentage points on the China Computer Federation (CCF) competition and the Weibo datasets respectively. Experimental results show that the proposed model can fully fuse multi-modal information and effectively improve the accuracy of false information detection.